install.packages("nycflights13")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/nycflights13_1.0.1.tgz'
Content type 'application/x-gzip' length 7115754 bytes (6.8 MB)
==================================================
downloaded 6.8 MB
The downloaded binary packages are in
/var/folders/yq/9s478ryj1fz0t_86x7d9wm240000gn/T//RtmpLNrQbE/downloaded_packages
Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
library(tidyverse)
Create a tibble from a data frame.
as_tibble(iris)
This chunk give information on sepal length and width, petal length and width, and the species of the iris’.
Tibble with individual vectors.
tibble(x = 1:5, y = 1, z = x ^ 2 + y)
Use of backticks to create column names.
tb <- tibble(`:)` = "smile", ` ` = "space",`2000` = "number")
tb
Tribble
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
Tibble Print
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE))
Explicit tibble print
nycflights13::flights %>%
print(n = 10, width = Inf)
View nycflights13 data set
nycflights13::flights %>%
View()
Create a data frame
df <- tibble(
x = runif(5),
y = rnorm(5))
Extract a variable by name
df$x
[1] 0.4078880 0.3243437 0.1120017 0.7956879 0.6051586
Extract a variable by name
df[["x"]]
[1] 0.4078880 0.3243437 0.1120017 0.7956879 0.6051586
Extract a variable by position
df[[1]]
[1] 0.4078880 0.3243437 0.1120017 0.7956879 0.6051586
Extract a variable by name with a pipe
df %>% .$x
Extract a variable by position with a pipe
df %>% .[["x"]]
[1] 0.4078880 0.3243437 0.1120017 0.7956879 0.6051586
Answer the questions completely. Use code chunks, text, or both, as necessary.
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). If an object is a tibble then only the first ten observations will print. You may also use the is_tibble() function to determine if an object is a tibble. Identify at least two ways to tell if an object is a tibble. Hint: What does as_tibble() do? Turns an existing data set into a tibble. What does class() do? Identifies the class of an object. What does str() do? Identifies the basic structure of an object.
mtcars
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Both are means of calling out data, however the tibble option requires fewer keystrokes. Why might the default data frame behaviours cause you frustration? More keystrokes are required.
df <- data.frame(abc = 1, xyz = "a")
df$x
[1] "a"
df[, "xyz"]
[1] "a"
df[, c("abc", "xyz")]
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
Produces an inline csv file
read_csv("a,b,c
1,2,3
4,5,6")
Create a csv file but skip the first lines
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
Create csv file and skip a comment.
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
Create a csv file that doesn’t have column names on data
read_csv("1,2,3\n4,5,6", col_names = FALSE)
Create a csv file and assign column names a vector
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
Create csv file and add na to missing data
read_csv("a,b,c\n1,2,.", na = ".")
1: What function would you use to read a file where fields were separated with “|”? read_delim()
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file.csv", read_csv())
file <- read_delim("file.csv", read_tsv())
3: What are the two most important arguments to read_fwf()? Why? The two most important arguments are width and position. Because it allows the reading of files with of white space.
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
read_csv(“a,b1,2,34,5,6”)- only two columns are provided, so some data is lost read_csv(“a,b,c1,21,2,3,4”)- only 3 column names are provided so some data is lost read_csv(“a,b"1”)- the quotation marks are not closed read_csv(“a,b1,2,b”)- ?? read_csv(“a;b1;3”)- read_csv() works with commas, doesnt recognize semicolons
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later.
Table 4a dataset
table4a
Table 4a dataset renaming columns
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
Tidying data into cells
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
Tidy table 4a
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
Tidy table 4b
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
Left to join table 4a and table 4b.
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
Load Table 2
table2
Using pivot wider to create a new column
table2 %>%
pivot_wider(names_from = type, values_from = count)
2: Why does this code fail? Pivot_longer was omitted, so were quotation marks around 1999 and 2000. Fix it so it works.
Orininal Chunk
table4a %>%
gather(1999, 2000, key = "year", value = "cases")
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n
Fixed Chunk
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
library(tidyverse)
library(nycflights13)
Loading Flights
flights
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
Filter flights by day and time
filter(flights, month == 1, day == 1)
Save the results 01/01
jan1 <- filter(flights, month == 1, day == 1)
Save and print the results of flights on 12/25
(dec25 <- filter(flights, month == 12, day == 25))
Not using == error
filter(flights, month = 1)
Error: Problem with `filter()` input `..1`.
x Input `..1` is named.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `month == 1`?
Floating number results
sqrt(2) ^ 2 == 2
[1] FALSE
Use of near
near(sqrt(2) ^ 2, 2)
[1] TRUE
Use of near
near(1 / 49 * 49, 1)
[1] TRUE
All flights that departed in Nov or Dec
filter(flights, month == 11 | month == 12)
Sgorthand to find all nov and dec flights
nov_dec <- filter(flights, month %in% c(11, 12))
Flights that weren’t delayed by more than two hours.
filter(flights, !(arr_delay > 120 | dep_delay > 120))
Flights that weren’t delayed by more than 2 hours
filter(flights, arr_delay <= 120, dep_delay <= 120)
Creating data frame
df <- tibble(x = c(1, NA, 3))
Apply filter
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
Apply filter
filter(df, is.na(x) | x > 1)
1.1: Find all flights with a delay of 2 hours or more.
filter(flights, dep_delay >=120)
1.2: Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest== "HOU")
1.3: Were operated by United (UA), American (AA), or Delta (DL).
filter(flights, carrier =="UA" | carrier == "AA" | carrier == "DL")
1.4: Departed in summer (July, August, and September).
filter(flights, month =="7" | month == "8" | month == "9")
1.5: Arrived more than two hours late, but didn’t leave late.
filter(flights, dep_delay == 0 & arr_delay >= 120)
1.6: Were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best.
filter(flights, dep_delay >= 60 & arr_delay <= 30)
1.7: Departed between midnight and 6am (inclusive)
filter(flights, dep_time >= 0000 & dep_time <=600)
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
Between is a shortcut for less than and equal to and greater than and equal to. Yes, 1.7 couldve been shortcutted by
filter(flights, between(dep_time, 0 , 600))
3: How many flights have a missing dep_time? 8255 flights What other variables are missing? Departure delay, air time, arrival time, and arrival delay. What might these rows represent? The flights never left.
sum(is.na(flights$dep_time))
[1] 8255
filter(flights, is.na(dep_time))
4: Why is NA ^ 0 not missing? NA raised to the power of zero a value, zero. Why is NA | TRUE not missing? Anything ‘or true’ is always true. Why is FALSE & NA not missing? Anything ‘and false’ is always false. Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
Note: For some context, see this thread
arrange()Arrange flights
arrange(flights, year, month, day)
reorder columns by decending order
arrange(flights, desc(dep_delay))
Create data frame
df <- tibble(x = c(5, 2, NA))
Sort Missing Values
arrange(df, x)
Sort missing values
arrange(df, desc(x))
1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do?
arrange(flights, desc(is.na(dep_delay)))
2: Sort flights to find the most delayed flights. Find the flights that left earliest. Most delayed flights
arrange(flights, desc(dep_delay))
Flights that left the earliest
arrange(flights, dep_delay)
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air.
Fastest flights
arrange(flights, air_time)
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
4: Which flights travelled the longest? Which travelled the shortest? Flights that travelled the longest
arrange(flights, desc(distance))
Flights travelled the shortest
arrange(flights, distance)
select()Select columns by name
select(flights, year, month, day)
Select all columns btwn year and day (invlusive)
select(flights, year:day)
Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
Rename variables
rename(flights, tail_num = tailnum)
move variables to start of data frame
select(flights, time_hour, air_time, everything())
1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways. use the deslect function
select(flights, dep_time, dep_delay, arr_time, arr_delay)
Use the starts with function
select(flights, starts_with('dep'), starts_with('arr'))
use contains()
select(flights, contains('delays'), contains('time'))
2: What happens if you include the name of a variable multiple times in a select() call?
The variables you repeat will be omitted
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
One_of() allows you to select parts of the dataframe. There may only be a set number of columns and information.
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
select(flights, contains('TIME'))
Yes, because R is case sensitive but apparently contains() is not.